## Core concepts 1. Distributed Systems Basics 1. What is a node, cluster, master/slave vs peer-to-peer, task scheduling 2. Concepts like latency, throughput, fault tolerance, and synchronization 2. Networking Basics TCP/IP, sockets, message passing Protocols like HTTP, gRPC, or MPI 3. Parallelism vs Distribution 1. Understand how multithreading/multiprocessing (e.g. OpenMP, multiprocessing in Python) differs from distributed systems 2. Learn how tasks are coordinated across machines ### Topics to Explore - MPI basics: mpirun, MPI_Send, MPI_Recv, collective ops - Sockets and networking protocols - Load balancing and distributed job scheduling - CUDA-aware MPI or NCCL - Fault tolerance & resilience (optional but good for production) ### Python Python Tools 1. MPI for Python (mpi4py) Most mature option for distributed parallelism in scientific computing; wraps MPI (Message Passing Interface) ### C/C++ Tools 1. MPI (e.g. OpenMPI or MPICH) Industry standard for C/C++ distributed computing 2. ZeroMQ or nanomsg For more flexible messaging between C/C++ apps 3. gRPC Modern, performant way to do cross-language RPC (great for C++ ↔ Python communication) ## Recommended First Steps Learn mpi4py and run an MPI-based Python script across two machines on your network. Implement basic socket-based message passing in Python and C. Later, replace CPU computation with CUDA kernels and integrate NCCL or MPI for GPU-to-GPU communication.